feat: avro-sdc #18

sebastianswms · 2023-06-23T20:20:19Z

Feature additions:

Avro file parsing
- Multiple schema coercion strategies.
_sdc columns
- Information on each record's file name and line number.
- Enable/disable with additional_info config.
Header and footer configuration
- header_skip and footer_skip to ignore the beginning and end of delimited files.
- override_headers to allow headerless delimited files and/or renaming of existing headers.

Closes #5
Closes #11
Closes #12

Additional info in _sdc columns header_skip, footer_skip, override_headers

visch · 2023-06-23T20:34:06Z

README.md

 | quote_character             | False    | "       | The character used to indicate when a record in a CSV contains a delimiter character. |
+| header_skip                 | False    |       0 | The number of initial rows to skip at the beginning of each delimited file. |


This should be just for delimited files right? same for delimited, header_skip, footer_skip, and override_headers

Yes, it should only be for delimited files. I've prepended delimited_ to each name to make this more clear.

visch · 2023-06-23T20:35:36Z

README.md

-| delimiter                   | False    | detect  | The character used to separate records in a CSV/TSV. Can be any character or the special value `detect`. If a character is provided, all CSV and TSV files will use that value. `detect` will use `,` for  CSV files and `\t` for TSV files. |
+| file_type                   | False    | delimited | Can be any of `delimited`, `jsonl`, or `avro`. Indicates the type of file to sync, where `delimited` is for CSV/TSV files and similar. Note that *all* files will be read as that type, regardless of file extension. To only read from files with a matching file extension, appropriately configure `file_regex`. |
+| compression                 | False    | detect  | The encoding to use to decompress data. One of `zip`, `bz2`, `gzip`, `lzma`, `xz`, `none`, or `detect`. If set to `none` or any encoding, that setting will be applied to *all* files, regardless of file extension. If set to `detect`, encodings will be applied based on file extension. |
+| additional_info             | False    |       1 | If `True`, each row in tap's output will have two additional columns: `_sdc_file_name` and `_sdc_line_number`. If `False`, these columns will not be present. |


Defaulting to True I think makes more sense here. Does it apply for all file types though? Like avro I'm not sure it does 🤷‍♂️ we'd have to force some kind of implementation for each stream type maybe some can just return n/a or something if it doesn't make sense.

It already does default to True. required=False, default=1. That's just how --about format=markdown does it.

I think it applies to all file types, or at least all file types we have so far. For avro it just increments _sdc_line_number for each record in a file and resets the count for each new file.

visch · 2023-06-23T20:47:41Z

tap_file/streams.py

+        header_skip = self.config["header_skip"]
+        footer_skip = self.config["footer_skip"]
+
+        for reader_dict in self._get_readers():


This means that the header and footer are being first ran through DictReader right? I think this would cause issues for most of the use cases for this.

I can see why. I've updated it to process delimited_header_skip and delimited_footer_skip before parsing with DictReader.

Refactor header_skip and footer_skip to process skipping before parsing.

visch · 2023-07-12T03:19:57Z

Let's get this merged and conficts fixed, and then fix Pytest etc. Looks good to me

sebastianswms added 2 commits June 23, 2023 16:13

Avro files

0f17988

Additional info in _sdc columns header_skip, footer_skip, override_headers

Updates from self code review.

306461d

sebastianswms marked this pull request as ready for review June 23, 2023 20:32

sebastianswms requested a review from visch June 23, 2023 20:32

sebastianswms marked this pull request as draft June 23, 2023 20:51

sebastianswms removed the request for review from visch June 23, 2023 20:51

visch requested changes Jun 23, 2023

View reviewed changes

Add tests

856dc41

Refactor header_skip and footer_skip to process skipping before parsing.

sebastianswms marked this pull request as ready for review June 27, 2023 17:04

sebastianswms requested a review from visch June 27, 2023 17:04

visch approved these changes Jul 12, 2023

View reviewed changes

visch merged commit e083c0d into MeltanoLabs:main Jul 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: avro-sdc #18

feat: avro-sdc #18

sebastianswms commented Jun 23, 2023

visch Jun 23, 2023

sebastianswms Jun 27, 2023

visch Jun 23, 2023

sebastianswms Jun 27, 2023 •

edited

Loading

visch Jun 23, 2023

sebastianswms Jun 27, 2023

visch commented Jul 12, 2023

		\| quote_character \| False \| " \| The character used to indicate when a record in a CSV contains a delimiter character. \|
		\| header_skip \| False \| 0 \| The number of initial rows to skip at the beginning of each delimited file. \|

feat: avro-sdc #18

feat: avro-sdc #18

Conversation

sebastianswms commented Jun 23, 2023

visch Jun 23, 2023

Choose a reason for hiding this comment

sebastianswms Jun 27, 2023

Choose a reason for hiding this comment

visch Jun 23, 2023

Choose a reason for hiding this comment

sebastianswms Jun 27, 2023 • edited Loading

Choose a reason for hiding this comment

visch Jun 23, 2023

Choose a reason for hiding this comment

sebastianswms Jun 27, 2023

Choose a reason for hiding this comment

visch commented Jul 12, 2023

sebastianswms Jun 27, 2023 •

edited

Loading